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FOREWORD 


This Indian Standard was adopted by Bureau of Indian Standards, after the draft finalized by the ‘Indian Language 
Technologies and Products Sectional Committee, and approval of the Electronics and Information Technology 
Division Council. 


The process of coming up with a common tagsets for Indian languages was set in motion with the project on 
Indian Language to Indian Language Machine Translation (ILMT), that started in 2006 led by HIT Hyderabad 
involving 11 other academic institutions and supported by Ministry of Electronics and Information Technology 
(MeiTY), Government of India. This led to a special effort for creating a national standard for POS tags under BIS. 


These standards have been developed following a rigorous process of deliberations starting with existing common 
tagsets evolved through ILMT project, similar standards elsewhere in the world and considering the linguistic 
properties of Indian languages. It was done by consulting linguists and experts working in the field and also 
taking learnings from annotators working in major projects related to Indian languages — ILMT (MeiTY), ILPOS 
(Microsoft), LDCIL (CIIL), and ILCI (MeiTY). Several workshops were conducted over a period of three to four 
years in which linguists, computer scientists, language experts and annotators participated where the necessity and 
justification for each tag was discussed threadbare.” 


Standardization of linguistic information is done at different levels and in different phases. The standards for POS 
tagging is taken up as the first step in this direction. 

The scheme proposed here has the property of extensibility which allows flexibility for adding more and more 
languages to the current set. 


The composition of the Sectional committee and panel responsible fo the preparation of this standard 1s given in 
Annex B. 
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0 INTRODUCTION 


Annotated text corpora is an important resource for advances in NLP research and for developing different language 
technologies. The annotation of corpora is done using a set of tags, which mark the linguistic properties of a 
word, sentence or discourse. The corpora annotated with various linguistic information form a critically precious 
resource for language technologies but involve a large amount of effort and time. Therefore, it is important to 
create corpora which once created can be used for various purposes. Standardization of the tag sets and tagging 
schemes for annotating various types of linguistic information is extremely important for the creation of consistent 
and usable corpora. 
Following issues are considered for coming up with appropriate standards in this field: 

a) What all linguistic information should be annotated in the corpora; and 

b) How should it be annotated. 
Point a) above guides the development of a scheme which allows annotating various types of linguistic information 
in a seamless manner. Linguistic information could be about the syntactic category of each word in a given 
sentence (Parts of Speech (POS), a word’s internal structure (morph analysis), grammatical analysis of a sentence 
(sentence structure or syntactic parsing), semantic level information both at the lexical and sentential levels, 
discourse structure, pronoun referents (anaphora resolution) etc. Though we may decide to annotate only one 
or two types of information initially, we should be able to add additional information at other levels on top of 
the already annotated data seamlessly at a later stage. Therefore, the design of the scheme shall support this 
basic principle. It is also extremely important that the scheme design shall not in any way compromise with the 
linguistic information to be annotated. In other words, the scheme at any level may choose to initially annotate 
coarser level of information if annotating finer level poses certain computational concerns, however, it should be 
linguistically correct and should not go against the basics of linguistic analysis. 
The purpose of undertaking a costly (both in terms of effort and money) task, such as creating the linguistically 
annotated corpora should also be clear while designing the scheme. The Natural Language Processing (NLP) 
community finds this annotated corpora to be a precious resource as the annotated corpora is used for the following 
purposes: 

a) Machine learning; 

b) Extracting linguistic properties/grammars; 

c) Language modelling; and 

d) Any other NLP need. 


0.1 PRINCIPLES 


The basic principles for designing the schemes and tag sets for different levels of linguistic analysis mark an 
important basis for the final outcome. Considering the type of linguistic information identified, the linguistic scene 
in India and the purpose of annotation, the following issues are identified to be taken care of while designing a 
scheme that can be accepted as a standard for annotation: 


a) The scheme should be flexible; 

b) The scheme should ensure consistency in the annotation; 

c) The scheme should be annotator friendly; 

d) The scheme should be compatible/mappable to other existing schemes; 
e) The scheme should be applicable across languages; and 


f) It should also be possible to annotate corpora separately for various levels of linguistic information and 
merge it to get a linguistically richer corpora. 
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Indian Standard 


LINGUISTIC RESOURCES — POS TAG SET FOR 
INDIAN LANGUAGES — GUIDELINES FOR 
DESIGNING TAGSETS AND SPECIFICATION 


1 SCOPE 


This Indian Standard provides guidelines for designing 
POS tagsets and labels for Indian languages. This 
standard also defines Tagsets for the Indian languages 
Bangla, Gujarati, Hindi, Kashmiri, Konkani, Maithili, 
Marathi, Punjabi, Urdu, and Dravidian Languages 
(Telugu, Kannada, Malayalam and Tamil). 


Tagsets for the languages from the North-Eastern 
region are not defined in this standard. However, 
the methodologies for designing tagsets and labels 
specified in this standard are such that they can easily 
be extended to the remaining Indian languages. 


2 REFERENCE 


3 TERMINOLOGY AND SYMBOLS 


3.1 Terminology 


For the purpose of this standard, the following 
definition(s) shall apply. 


3.1.1 Annotation — Linguistic information added 
to primary data, independent of its representation 
[SOURCE: ISO 24612 : 2012, 2.3] 


3.1.2 Case Marker — A marker for the grammatical 
role of a noun/phrase/clause in a sentence. 


3.1.3. Chunks — A chunk is a minimal constituent 
unit. Non-overlapping pieces of text. Chunks have 
a head with some other words which go with it. It is 
non-recursive. 


3.1.4 Classifier — Classifiers are words or suffixes that 
mark the category of a noun in terms of its semantic 
properties, such as gender, shape, size etc. 


3.1.5 Common Noun — Common nouns are nouns that 
are names of a group of similar objects. 


3.1.6 Coordinators — Co-ordinators are words that 
conjoin any two linguistic objects such as words/ 
phrases/clauses. 


3.1.7 Echowords (ECH) — Echo words are the second 
part of a two-word expression. The echo words are 
partial reduplication or repetition of the first word, 
however, the initial segment or syllable of the repeated 
word is replaced by a fixed segment or syllable. For 


example, in Hindi, the fixed syllable is ‘waa’ (chay 
vay), in Punjabi it is ‘shaa’ (chay-shay), etc. 


3.1.8 Extensibility — Being able to extend, flexibly 
increase if needed. 


3.1.9 Foreign Word (RDF) — A word from a language 
other than the language of the document. 


3.1.10 /ntensifier — Expressions that intensify the 
meaning of an adverb or an adjective. 


3.1.11 /nterjection — An interjection is a short utterance 
(sound/word/phrase) which can occur independently 
(normally before a sentence) and conveys emotion. 


3.1.12 Lexeme — An abstract unit generally associated 
with a set of forms (3.14) sharing a common meaning. 
[SOURCE: ISO 24613 : 2008] 


3.1.43 Linguistic Annotation — Annotation that 
provides linguistic information about the segments in 
the primary data. 


3.1.14 Linguistic Information — Information that is 
related to linguistic aspects such as POS, word, morph 
analysis, grammatical structures etc. 


3.1.15 Linguistic Properties — Grammatical or 
morphological information about a sentence or word. 


3.1.16 Linguistic Resource — A data resource that 
contains texts, lexicons, grammars etc 


3.1.17 Local Word Grouping (LWG) — Local Word 
Grouping is a process whereby the inflectional words 
are grouped with their respective nouns and verbs. This 
is based on Panini’s notion of ‘pada’ and is developed 
for computational modelling of Indian languages. 
(reference NLP, A Paninian Perspective by Akshara 
Bharati et al). 


3.1.18 Matrix Language — Main language in a mixed 
language text. 


3.1.19 Morph Analyser — A computational tool that 
gives the morph analysis of each word. 


3.1.20 Morph Analysis — Analysing the morph features 
of a word. Features such as root, category, gender, 
number, person etc. 
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3.1.21 Morpho-syntactic — The internal structure of 
words and how they are related to other words in a 
sentence. 


3.1.22 Morphology — Study of the word structures. 


3.1.23 Multi-Word Expressions (MWE) — Multi-Word 
Expressions are defined as “idiosyncratic interpretations 
that cross word boundaries (or spaces)” [Source: Ivan 
A. Sag] et al, 2001]. 


3.1.24 Nloc — Nloc is a category defined in IL POS 
tags scheme to cover an important phenomenon of 
Indian languages. Certain expressions such as ‘Upara’ 
(above/up), “nlce' (below) ‘pahale’ (before), ‘Age’ 
(front) etc are content words denoting time and space. 
These expressions/words can also occur as part of a 
postposition. Hence, they do two functions. To avoid 
any confusion in their role in a particular context, a 
separate category has been defined. 


3.1.25 Noun — Words that refer to people, places, 
things, ideas, etc. 


3.1.26 Part of Speech (POS) — lexical category or 
word class category assigned to a lexeme (3.1.12) 
based on its grammatical properties. [SOURCE: 
ISO 24613 : 2008]. 


NOTE — 1 to entry: Typical parts of speech for European 
languages include: noun, verb, adjective, adverb, preposition, 
etc. 


3.1.27 Pronoun Referents (Anaphora) — Linguistic 
mechanism by which the interpretation of a referring 
expression depends on another expression mentioned 
in the same text or discourse. 


NOTES 


1 The notion of anaphora is more general than that of 
coreference (3.3): the interpretation of anaphora is context- 
dependent, whereas coreference is determined rather rigidly 
independently to its possible use of context. 


2 The term is used in this document in its general sense since, 
for instance, no specific distinction is made here with the 
notion of cataphora ( that is, coreference) with a more specific 
expression occurring later in a discourse). 


[SOURCE: ISO 24617-9 : 2019] 


3.1.28 Proper Noun — Specific names of people, 
places. organisation etc. 


3.1.29 Punctuation mark (PUNC) — Non-numeric, 
non-alphabetic characters that carry specific meaning 
and serve as separators. [SOURCE: ISO 10324 : 1997] 


3.1.30 Quotatives — Quotatives conjoin a subordinate 
clause to a main verb in a sentence in some languages. 


3.1.31 Semantic Properties — Semantic features such 
as animate, human, young, male, tall, big etc. 


3.1.32. Subordinators — Subordinators conjoin a 
dependent element to its head. 


3.1.33 Symbol (SYM) — Symbols, such as dollar sign, 
pound sign, plus, minus etc. 


3.1.34 Syntactic Analysis — Grammatical analysis of 
a sentence. 


3.1.35 Taggers — Tools that automatically mark some 
linguistic information in a text. 


3.1.36 Tagset — Comprehensive set of tags used for the 
morpho-syntactic description of a language. 


NOTE — The ISOC at data category registry is to be used as 
the reference for describing a tagset. 


[SOURCE: ISO 24611 : 2012] 


3.1.37 Thematic roles/Predicate Argument structure — 
Formal representation of the core semantic content of 
an utterance, consisting of a predicate constant, and its 
arguments. 


NOTES 


1 In classical logic-based semantics, this corresponds to 
predicate argument structures in first-order predicate logic. 


2 One of the arguments can be a variable uniquely identifying 
the instance of the predicate argument structure to allow 
references to it in other predicate argument structures. 

3 The representation of event semantics is subject to many 
variations; some of them, can have separate predicates for each 
semantic role relation. In this case, the predicate argument 
structure of an utterance is the sum of the individual predicate 
semantic role assertions representing the semantic content of 
the utterance. 


[SOURCE: ISO 24617-4:2014] 


3.1.38 Unknown (UNK) — Words in a text that are not 
part of the language of the text or unfamiliar. 


3.1.39 Word Sense — Meaning associated with a 
lexeme (3.1.12) in a context. 


NOTE — The ‘river bank’ sense of bank and the ‘financial 
institution’ sense of bank are considered to be two different 
word senses, or lexical units, with the same word form, or 
lexeme (3.1.12). I called him on the radio and Call me a taxi are 
associated to different word senses of the lexeme (3.1.12) call. 
Unrelated senses, as in bank, are called homonyms. Senses 
of the same word form or lexeme which are clearly related 
(and can be difficult to distinguish) are called polysemes, for 
example, Coins with an image of the king, preoccupied with 
body image, evokes a strong mental image. 


[SOURCE: ISO 24617-4 : 2014] 


SECTION 1: LINGUISTIC INFORMATION 
AND DESIGN PRINCIPLES 


4 PRINCIPLES FOR DESIGNING THE 
LINGUISTIC STANDARDS FOR INDIAN 
LANGUAGE CORPORA 


The design of the Linguistic Standards for Corpora 
Annotation is based on the following Principles. This 
standard follows these principles in the context of POS 


tagging. 
4.1 Principle 1: Generic Tag Sets 


The scheme should not be biased towards any a single 
or a group of languages. The scheme should work 
for all the languages, even for languages beyond 
Indo-Aryan and Dravidian families. 


4.2 Principle 2: Layered Approach 


If any text having rich linguistic information can 
be annotated, then the resultant resource will be a 
rich resource of linguistic properties made explicit. 
Therefore, it is desirable to annotate as much linguistic 
information as possible in a given text. The linguistic 
information encoded in a text can be quite complex and 
diverse. Capturing all the diverse linguistic information 
at one go makes the task highly complex and, therefore, 
is not a good idea from the manual annotation point 
of view. Also, adopting a semi-automatic approach 
to reduce the manual effort may not work well if one 
decides to annotate complex linguistic information 
at one go. Therefore, a layered approach can be 
considered as a practical approach. In the layered 
approach, the linguistic information can be broken 
up into smaller bites of information and each bite of 
information can be annotated in one layer. Some layers 
may have dependencies over the others and some may 
be independent depending on the type of linguistic 
information one is marking. Following are some of the 
layers which need to be included for marking linguistic 
information in the Indian language corpora rich: 


a) Layer 1: Morphology 

b) Layer 2: POS <morphosyntactic> 
c) Layer 3: LWG 

d) Layer 4: Chunks 

e) Layer 5: Syntactic analysis 


f) Layer 6: Thematic roles/predicate argument 
structure 


g) Layer 7: Semantic properties of the lexical items 
h) Layers 8,9,10,11: Word sense, pronoun referents 
(Anaphora), etc, etc 
4.3 Principle 3: Hierarchy within each Layer 


Tag sets have to be decided considering the possibility 
of including more than one level of information. For 
example, within POS one can include Verb as a higher 
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level category and finite or non-finite verbs as a subtype 
under verbs. 


4.4 Principle 4: Extensibility 


Extensibility is a basic property a tag set scheme should 
have for making it flexible. Extensibility here implies 
that the scheme should be such that if and when a new 
POS category is noticed/observed, it should be possible 
to incorporate it in the system either at the top level 
of hierarchy or as a subcategory of an existing type 
depending on the case. 


4.5 Principle 5: Tag Redundancy 


If a tag is redundant for a language, it should be 
deprecated. In other words, the tagset would have the 
tag, however, the concerned language need not include 
it in its tag set/or utilize it for tagging. 


4.6 Principle 6: Guidelines 


Use of Internationally accepted standards and guidelines 
such as ISO 639 : 3, ELRA, EAGLES, ISLE etc. 
Wherever available, Global guidelines such as ELRA, 
EAGLES, ISLE etc. shall be followed to the extent they 
work for Indian languages. For naming a particular 
language, Alpha-3 Codes for the representation of 
names of languages as mentioned in ISO 639-3 shall 
be used. 


4.7 Principle 7: Compatibility 


Standards should be mappable to/compatible with 
existing schemes. 


4.8 Principle 8: Applicability 


Standard is designed to handle a wide range of 
applications and also to support all types of NLP 
Research efforts independent of a particular technology 
development approach. 


4.9 Principle 9: Annotator friendly 


This is an important principle. The primary purpose 
of the scheme is to manually/semi-automatically 
annotate corpora, a task that will be carried out by 
language experts. However, the task involves decision 
making at every step and if the tag set has several 
categories/tags which can have multiple interpretations 
then this leads to inconsistency and confusion in the 
annotation. Therefore, ambiguities should be avoided 
in the scheme. 


5POSSIBLE LINGUISTIC INFORMATION TO 
BE INCLUDED IN CORPORA ANNOTATION 
STANDARDS 


Although developing standards for annotation of Indian 
Language corpora focuses on POS, a layered approach 
is preferred with the long term goal to develop a generic 
scheme that can incorporate new kinds of linguistic 
information easily at a later stage. Thus, the scheme 
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for each level should be such that adding the next 
level of information should be seamless and not pose 
conflicts in the annotation. Since the goal is larger and, 
apart from linguistic information, several NLP tasks 
need information about the text itself and its structure, 
marking the following types of information in a text is 
also useful. 


5.1 Marking Text Information 
5.1.1 Marking Metadata in a Text 


A corpus consists of several texts. It is important to 
maintain some meta level information about each 
document in the corpora. Thus, information such as 
title, author, source, domain, font type, creator of the 
document etc. should be marked on each text. Like 
all other annotation tasks, this can also be done either 
manually or semi-automatically. 


5.1.2 Marking the Paragraph Boundaries in a Text 


Any text is structured into paragraphs and sentences. 
Automatic identification of these units in a text is 
important for further processing. Therefore, annotating 
paragraph boundaries is also needed. Standard notation 
for marking paragraph boundaries like “<p>” and 
“</p>” can be used for this annotation. 


Marking the sentences and segments with their 
respective boundaries in a paragraph/text. Titles, 
bulleted points (incomplete sentences) etc. would be 
marked as a segment. 

Example 1: 


<segment> Project 
Statement </segment> 


Gutenberg Mission 


<segment> To encourage the creation and 
distribution of eBooks. </segment> 


<sentence>The reason for this is that 
99 percent of the hardware and software a 
person is likely to run into can read and search 
these files. </sentence><sentence>Any other 
system of etext storage is going to fall short of 
an audience of 99 percent.</sentence> 
There could be cases where a text has words in a 
language other than the matrix language of the text. For 
example, a text in Hindi may contain English words 
written in roman. From the processing point of view, it 
is important to mark this information as well. 
Example: 
<segment>Xt 4 UHd(«Language- "English" 
>Raman Effect — v0 «/Language-" English") 
Addad d dex UHRA «Language-"English" 
> vl «/Language-" English" 
</segment> 
<p><sentence>WUtat siddd d RAA 
URR Het SUS 


Al quw dck BE de Sa 


Adhpuy UP p< Weihceihmeddp 
Heads Pad Ud KG) 
«Language-"English" v 
«/Language-"English"» SIddd q PP doe 
WRU HATA |«/sentence” 


All of the above information can be marked 
automatically to a large degree of accuracy. However, 
the task of the manual annotators is to check and correct 
the marked information where ever necessary. 


5.2 Marking Linguistic Information 


The linguistic information should be marked at various 
levels after marking the text information as mentioned 
in 5.1. Following levels of linguistic information is 
considered crucial for various NLP tasks. 


5.2.1 Token/Word Level 


Identification of the tokens itself is a non-trivial task. 
However, one can decide whether to have it as a step 
in the manual annotation of the corpora. The following 
information shall be marked at the token/word level. 


a) Morph Analysis — The morph features of a given 
word may be marked. If the word has multiple 
morph feature sets, all features shall be marked. 


Example: 
pens  «root-"pen"  cat-"n"  gender-"m" 
number=”pl” person=”3”»|sroot=”pen” 
cat=”v” gender=”m” number=”sing” 


person=”3” tense=”present” aspect=”hab”» 


b) POS—aword may be tagged for its POS category 
in a given sentence. 


Example: 


I need two «pos=”NN”»pens «/pos=”NN”” to 
finish this article. He <pos=”VBS”> pens </ 
pos=”VBS”> his views regularly. 


c) Word Sense — the appropriate sense of a word in 
a given context may be marked. 


Example: 


I need two «word sense-"pen"^ pens </ 
word sense-"pen"» to finish this article. 
He «word sense=”write”” pens «word _ 
sense=”write”» his views regularly. 


d) Any other relevant linguistic information at the 
token level 
5.2.2 Sentence Level 


The following linguistic annotation should be marked 
at the Sentence level: 


a) Chunks 

b) Multi-Word Expressions (MWEs/) 

c) Local Word Groups (LWGs) 

d) Constituent Structures (phrases) 

e) Grammatical analysis of a sentence (dependency, 


constituent, both)-Grammatical analysis of each 
sentence in a corpus in a chosen grammatical 
formalism is another type of linguistic information 
that can be marked at the sentence level. 


f) Named Entities (NE) - Although Proper names 
have the same functions syntactically as common 
nouns, they need to be treated differently for 
various NLP tasks. For example, proper names 
are not always translated in a translation task. 
Thus, identifying Proper names while processing 
a natural language text input is a major step in 
NLP. Corpora annotated with named entities help 
in developing a Named Entity Recognizer (NER). 


wa 


Predicates and their arguments-All the verbs in a 
sentence and their arguments are marked. 


g 


Example: 
Ram_Agent gave Mohan_Receipient a book 


5.2.3 Discourse Level 


The following type of linguistic information shall be 
annotated at the discourse level: 


a) Pronoun referents, 
b) Coreference chains, and 


c) Inter-clause relations (connectives, the clauses 
they connect and the type of semantic relations 
these clauses hold to each other), focus, topic 


5.2.4 Any other Linguistic Information 


Any other linguistic information can also be marked 
depending on the application. 


SECTION 2: STANDARDS FOR THE POS 
TAG SET 


6 POS TAG SET STANDARDS 


The POS tag sets for Indian languages are designed 
following a hierarchical approach and with the 
assumptions given in 6.1 to 6.3: 


6.1 Assumptions 


a) POS tagging is not a replacement for morph 
analysis. The detailed morph analysis of a word 
comes from a morph analyser. 


A word in a sentence carries various types of 
information, some of which are evident from its 
form (number, gender, person, case etc) and some 
of them are contextual (its grammatical function 
in a given context (part of speech)). The part 
of speech of a word can be looked at from two 
dimensions — one as its possible grammatical 
functions in isolation. For example, the English 
word ‘pens’ can be both a noun and a verb if we 
look at it in isolation. Morph analysers capture this 
aspect of a word’s analysis. On the other hand, one 
can look at its grammatical function in a context. 
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A POS tagger would do this. Therefore, both, a 
morph analyser and a POS tagger provide the part 
of speech of a given lexical item. 


However, since a morph analyser looks at a word 
in isolation it provides all of its possible morph 
analyses, including multiple parts of speech. A 
POS tagger, as mentioned above, looks at a word 
in a context (small window) and provides the 
POS of the word in a syntactic context. Therefore, 
one can assume that the task of POS tagging is 
primarily to disambiguate the categories provided 
by the morph analyser and select the appropriate 
category in a context. Given this assumption, 
the POS tag set is designed primarily with the 
grammatical category in mind and any other 
morph features which can be obtained from 
a morph analyser are not included at the POS 
level. This assumption is based on Principle 2 in 
Clause 4. 


Text input to the POS tagger has already undergone 
a segmentation process. Thus, every token (word) 
to be assigned a POS is a single lexical item and is 
not a token that internally contains more than one 
lexical item as a result of Sandhi. For example, 
all Dravidian languages have a highly productive 
Sandhi process. Therefore, often a noun and a verb 
are concatenated into a single word (token in this 
case). It is obvious that a token which is internally 
both, a noun and a verb, cannot be assigned either 
of the two parts of speech category. It is important 
to split such tokens into two or more tokens as 
the case may be before a POS tag can be assigned 
to each one of them. Therefore, the assumption 
here is that the POS is to be assigned to a simple 
grammatical entity and not a complex one. 


b 


— 


Since POS is a lexical level annotation process, 
any unit that involves more than one lexical item 
will not be captured at the POS level. For example, 
the members/elements of any MWEs such as 
the conjunct verb, compounds, reduplications 
etc., will be marked for their individual lexical 
categories. 


c) A Multi-Word Expression (MWE) identifier layer 
would be applied for expressions that span more 
than one word as a single unit (MWEs). The 
layer will be applied after the POS tagging. At 
the POS level, therefore, the individual members 
of such expressions would be annotated for their 
respective lexical categories. 


6.2 Hierarchy at POS level 


The hierarchy oftags is directly related to the granularity 
of linguistic information. The deeper the hierarchy, the 
more fine-grained would be the linguistic information. 
Hence, it is advisable to keep the granularity of the 
POS at a coarser level. Thus, the hierarchy for most 
POS categories would normally be up to two levels. 
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The maximum depth for the POS tags is three levels 
so far. 


The basic idea here is to keep the granularity coarse at 
the POS level. Since finer information about a lexical 
item comes either from its formal structure (morph 
features), or its semantic properties (such as abstract, 
concrete for nouns), or from its syntactic roles (for 
example, subject, object), or from its semantic relation 
with the other lexical items in a sentence (for example, 
agent, theme, modifier etc). Based on principle 2 above, 
all this information falls in the other layers of linguistic 
annotation. 


Principle 2 ensures that the linguistic information 
contained in corpora is split rightly and annotated 
comprehensively in various layers in a modular fashion. 


6.3 POS Tag Set for Indian Languages (a Superset) 


Annex A provides a superset of POS tags for Indian 
languages. The POS tags are defined based on the 
Principles given in 4 and the assumptions as given in 
6.1. 


Annex B provides Tag sets for languages Bangla, 
Gujarati, Hindi, Kashmiri, Konkani, Maithili, Marathi, 
Punjabi, Urdu, and Dravidian Languages that are 
developed taking a subset of the Indian Language (IL) 
POS tag set. 


A common tagset is developed for Kannada, 
Malayalam, Tamil, and Telugu (Dravidian Languages). 
A separate table having the same tagset is provided for 
Tamil where the labels are rendered in Tamil as well. 


7 POS TYPES AND THEIR TAGS 


This section describes the classification and categories 
of tags that need to be included in the POS tag set 
schema and how it has to be interpreted by the 
annotators unambiguously. 


It contains 11 categories of top level POS Tag sets and 
their respective subcategories 


The language specific tag sets can have all or some of 
the top level of tags. They can also have a top level of 
a category but not its subtype if the language does not 
reflect it. For example, the sub type of verbs as finite 
and nonfinite is not reflected in Hindi at the lexical 
level. It is only at a later stage of local word grouping 
that Hindi reflects the finiteness of its verbs. Therefore, 
the Hindi tag set would not have the second level of 
verb hierarchy. The third subtype has been included for 
very specific cases and with a view that a language may 
choose to keep that level optionally. 


A tag has also been assigned to each category. Language 
specific tag sets can be designed selecting the categories 
and their tags depending on the linguistic nature of the 
language. 


7.1 Noun (N) 


Noun is a top level category with a sub-type level 
of depth 1. It has four subtypes at the next level of 
hierarchy — common noun, proper noun, verbal noun 
and nloc. The tag “N” shall be assigned to all nouns. 


7.2 Common Noun (NN) 


A noun denoting generic expressions is a common 
noun. Thus, boy, girl, house, book, cow, city etc are 
common nouns. The tag “NN” shall be assigned to all 
common nouns. 


7.3 Proper Noun (NNP) 


The tag “NNP” shall be assigned to all Proper Nouns 
(nouns which denote person names, organization 
names, place names etc). The tag *NNP" shall be used 
to tag Verbal Nouns 


7.4 Verbal Noun (NNV) 


Verbal noun category is for those nouns which are 
derived from verbs. This category is only applicable 
for Tamil and Malayalam verbs and not for any other 
Indian language. The tag “VN” shall be used to tag 
Verbal Nouns. 


7.5 Nloc—Particular Nouns of Locations — Space and 
Time (NST) 


This category registers the distinctive nature of some 
of the locational nouns which also function as part of a 
complex postposition. The tag NST shall be used to tag 
Nloc category words. 


The category Nloc covers an important phenomenon of 
Indian languages. Certain expressions such as * Upara? 
(above/up), ‘nice’ (below) ‘pahale’ (before), ‘Age’ 
(front) etc. are content words denoting time and space. 
These expressions, however, are used in various ways. 


NOTES 

1 A detailed discussion on the introduction of Nloc and NST 
is given in ILMT POS tag guidelines [bibliography reference 
here]. 

2 NST tag SHALL be used only for those time/place 
expressions which perform both the noun and relation marker 
function. Time/place expressions which are NOT ambiguous 
in their function (such as kala (yestrday), Aj (today), idhar 
(here), udhar (there)) SHALL not be marked by NST tag. 


7.6 Pronoun (PR) 


Linguistically a pronoun is a variable and functionally 
it is a noun. Tagging pronouns as a category separate 
from nouns will be helpful for anaphora resolution 
tasks. Pronouns shall be tagged as “PR” at the top level. 


The category Pronoun has five subtypes as listed below: 


a) Personal pronouns (for example, I, you, he, she 
etc.); 


b) Reflexive (for example, myself, herself, himself 
etc.); 


c) Relative (for example, who, which, when etc); 
d) Reciprocal (for example, each other etc); and 


e) WH-words: These are interrogative pronouns 
such as who, what, which, when in English. Each 
of these will have a separate tag as a subtype of 
PR. 


7.6.1 Personal Pronouns (PRP) 


Personal pronouns refer to individuals (people, 
animals, etc.) or things. The tag PRP shall be used to 
tag Personal pronouns. 


7.6.2 Reflexive Pronoun (PRF) 


Reflexive pronouns are those words that are used 
to refer to oneself. The tag PRF shall be used to tag 
reflexive pronouns. 


7.6.3 Relative Pronoun (PRL) 


Relative pronouns are words that are part of a clause 
that modifies a noun. The relative pronoun in the clause 
refers to the head noun, the noun that the clause is 
modifying. The tag PRL shall be used to tag Relative 
pronouns. 


7.6.4 Reciprocal Pronoun (PRC) 


A reciprocal pronoun expresses a mutual relationship 
between two entities. Examples of reciprocal pronouns 
are each-other, one another. The tag PRC shall be used 
to tag reciprocal pronouns. 


7.6.5 Interrogative Pronoun/WH-word (PRQ) 


Interrogative pronouns are used for asking questions. 
The tag PRQ shall be used to tag Interrogative pronouns. 


7.6.6 Indefinite Pronoun (PRI) 


Pronouns that refer to non-specific things or person are 
Indefinite pronouns. The tag PRI shall be used to tag 
indefinite pronouns. 


7.7 Demonstratives (DM) 


The category of Demonstratives is a top level category. 
These pronouns point to a noun and express its 
position as near or far. The tag “DM” shall be used 
to tag demonstratives. The subtypes of demonstrative 
pronouns are relative, interrogative, and indefinite. The 
tags for these would be DMR, DMQ, DMI respectively. 


7.8 Verb (V) 
The tag “V” shall be used to tag Verbs. 


The category of verb has three subtypes levels. One 
level distinguishes between the main verb and auxiliary 
verb. The second level is based on the verbal inflections 
denoting finiteness/non-finiteness etc. The Gerunds are 
also classified at this level as a subtype of main verb. 
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7.8.1 Verb (Main) (VM) 


Languages such as Hindi often have free morphemes for 
verbal inflection (for example, in the verbal sequence 
‘jaa rahaa hai’, ‘rahaa’ and ‘hai’ mark progressive 
aspect and present tense respectively. In this sequence 
verb ‘jaa’ is the main part whereas ‘rahaa’ and ‘hai’ 
are its auxiliaries. Thus, ‘jaa’ will be assigned the tag 
‘VM’. 

The tag “VM” shall be used to tag Verb (main) 


The main verb has subtypes of Finite, Nonfinite, 
Infinitival and Gerund. 


7.8.1.1 Verb finite (VF) 


A verb inflected for its tense, aspect etc. is a finite verb. 
The tag “VF” shall be used to tag verb Finite. 


7.8.1.2 Verb nonfinite (VNF) 


A verb not inflected for its tense etc. is a nonfinite verb. 
The tag *VNF" shall be used to tag nonfinite verb. 


7.8.1.3 Verb infinite (VINF) 


Infinitive type of verb does not get regular verbal 
inflections as it does not function like a verb. In English 
an infinitive verb is preceded by “to”. “to go’, “to eat’, 
‘to sleep’ are some examples of infinitival verbs. Not all 
Indian languages have this category. The tag “VINF” 
shall be used to tag Infinitive verbs. 


7.8.1.4 Gerund (VNG) 


Gerunds are nouns which are derived from verbs. These 
are categorized under the verb because they continue 
to behave like verbs by taking arguments etc. The tag 
“VNG” shall be used to tag Gerunds. 


7.8.2 Auxiliary Verb (VAUX) 


The auxiliary verbs also have subtypes of finite, 
nonfinite, infinitive and gerund. Some Indian languges 
reflect the non-finite properties even in auxiliary verbs. 
Thus, subtypes of Auxiliary verbs are also Non-finite, 
Infinitive, etc. The tag “VAUX” shall be used to tag 
Auxiliary verbs. 


7.9 Adjective (JJ) 


Adjectives denoting some properties of the nouns only 
fall under this category. 


The tag “JJ” shall be used to tag Adjectives. 
7.10 Adverb (RB) 


The term adverb in this scheme refers only to manner 
adverbs. Adverbs of time and space are not to be tagged 
here. The tag “RB” shall be used to tag adjectives. 


7.11 Postposition (PSP) 


Atype postposition, is used to annotate the postpositions 
and other case markers in languages, such as Hindi. 
The tag *PSP" shall be used to tag postposition. 
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7.12 Conjuncts (CC) 


Conjuncts shall be considered as a top level category 
POS Tag with its own sub types of coordinators and 
subordinators. The subordinators have a further sub 
type of ‘quotatives’. The tag “CC” shall be used to tag 
the top level of Conjuncts. 


7.12.1 Co-ordinators (CCD) 


Co-ordinating conjuncts are those words which conjoin 
two independent elements at the same level. The tag 
“CCD” shall be used to tag Co-ordinators. 


7.12.2 Subordinators (CCS) 


Subordinators conjoin a dependent element to its head. 
The tag “CCS” shall be used to tag subordinators. 


7.12.2.1 Quotative (UT) 


Quotatives’ occur in many languages and have the role 
of conjoining a subordinate clause to the main clause. 
It is left optional to the languages to go to this level of 
granularity or remain at the higher level keeping only 
two-level hierarchy for conjuncts. The tag “UT” shall 
be used to tag Quotatives. 


7.13 Particles (RP) 


The category particles is a top level category having 
classifier, intensifier, interjection and negation as its 
subtypes. Although some of these categories inflect 
in some languages, however, these are included as 
subtypes of the category Particles. The tag “RP” shall 
be used to tag particles. 


7.13.1 Particle Default (RPD) 


Particles are words which do not belong to any of the 
major parts of speech. The tag “RPD” shall be used to 
tag default particles. 


7.13.2 Classifier (CL) 


A classifier is a word that comes with nouns and 
“classifies” a noun depending on the type of its referent. 
The noun classes could be based on gender, shape, size 
etc. The tag CL shall be used to tag classifiers. 


7.13.3 Interjection (INJ) 


An interjection is a short utterance (sound/word/phrase) 
which can occur independently (normally before a 
sentence) and conveys emotion. The tag “INJ” shall be 
used to tag Interjections. 


7.13.4 Intensifier (INTF) 


Expressions that intensify the meaning of an adverb 
or an adjective. The tag “INTF” shall be used to tag 
intensifiers. 
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7.13.5 Negation (NEG) 


Words that negate a verb (some languages nouns as 
well) are negations. The tag “NEG” shall be used to 
tag negatives. 


7.14 Quantifiers (QT) 


Quantifiers play an important role in several syntactic 
structures and it is important to mark them separately 
for some NLP tasks as well. The tag “QT” shall be used 
to tag the top level of quantifiers. 


The category Quantifiers has three subtypes, which are 
General (QTF), Cardinal (QTC) and Ordinal (QTO). 


The tags for the Quantifiers and its subtypes shall be as 
given in Table 1. 


Table 1 Tags for the Quantifiers and its Subtypes 
( Clause 7.14 ) 


Category/Subtype Tag 

Quantifiers QT 

General QTF 

Cardinal QTC 

Ordinal QTO 
7.15 Residuals (RD) 


The category of Residuals is used for annotating 
cases that do not fall in any of the above grammatical 
categories. 

The subtypes under Residuals are Foreign word (RDF), 
Symbol (SYM), Punctuation (PUNC), Unknown 
(UNK) and Echowords (ECH). 

Since echo words, is not a valid word in a given 
language it has to be tagged separately. 

The tags for the residuals and its subtypes shall be as 
follows: 


Table 2 Tags for the Residuals and its Subtypes 
( Clause 7.15 ) 


Category/Subtype Tag 

Residuals RD 

Foreign word RDF 
Symbol SYM 
Punctuation PUNC 
Unknown UNK 
Echowords ECH 


SUPERSET OF POS TAGS 


ANNEX A 


A-1 SUPERSET OF POS TAGS FOR INDIAN 
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LANGUAGES 
SI No. Category Label Annotation Remarks 
Top level Subtype Subtype Convention** 
(level 1) (level 2) 
1 Noun N 
1.1 Common NN NNN 
1.2 Proper NNP N NNP 
1.3 Verbal NNV N NNV The verbal noun type 
is only for languages 
such as Tamil and 
Malyalam) 
1.4 Nloc NST N NST 
2 Pronoun PR PR 
2.1 Personal PRP PR PRP 
2.2 Reflexive PRF PR PRF 
2.3 Relative PRL PR PRL 
2.4 Reciprocal PRC PR PRC 
2.5 Wh-word PRQ PR PRQ 
2.6 INDEFINITE PRI PR PRI 
3 Demonstrative DM DM 
3.1 Deictic DMD DM DMD 
3.2 Relative DMR DM DMR 
3.3 Wh-word DMQ DM DMQ 
3.4 Indefinite DMI DM DMI 
4 Verb V 
4.1 Main VM V VM 
4.1.1 Finite V VM VF 
4.1.2 Non-finite VNF V VM VNF 
4.1.3 Infinitive VINF V VM VINF 
4.1.4 Gerund VNG V VM VNG 
4.2 Verbal V VN paTittam, naTattam, 
naTanam 
4.2 Auxiliary VAUX V VAUX 
4.2.1 Finite VAUX V VAUX VF 
4.2.2 Non-finite VNF V VAUX VN F 
4.2.3 Infinitive VINF | V VAUX VI NF 
4.2.4 Gerund VNG | V VAUX VNG 
4.2.5 PARTICIP VNP | V VAUX VNP 
LE NOUN 
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SI No. Category Label Annotation Remarks 
Top level Subtype Subtype Convention** 
(level 1) (level 2) 
Adjective JJ 
Adverb RB Only manner 
adverbs 
Postposition PSP 
Conjunction CC CC 
8.1 Co-ordinator CCD CC CCD 
8.2 Subordinator CCS CC CCS 
8.2.1 Quotative UT CC CCS UT 
9 Particles RP RP 
9.1 Default RPD RP RPD 
9.2 Classifier CL RP CL 
9.3 Interjection INJ RP INJ 
9.4 Intensifier INTF RP INTF 
9.5 Negation NEG RP NEG 
10 Quantifiers QT QT 
10.1 General QTF QT QTF 
10.2 Cardinals QTC QT QTC 
10.3 Ordinals QTO QT QTO 
11 Residuals RD RD 
11.1 Foreign word RDF RD RDF A word written in 
script other than the 
script of the original 
text 
11.2 Symbol SYM RDSYM For symbols such as 
$, and etc 
11.3 Punctuation PUNC RD PUNC Only for 
punctuations 
11.4 Unknown UNK RD UNK 
11.5 Echowords ECH RD ECH 


** The annotation is to be done using the lowest level tag of the type hierarchy. Once the lower level tag is selected, the higher level tags 
should be stored automatically. 
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ANNEX B 
LANGUAGE SPECIFIC POS TAGSETS AND LABELS 


B-1 POS TAGSET FOR DRAVIDIAN LANGUAGES (KANNADA, MALAYALAM, TAMIL, AND 
TELUGU) 


SI No. Category Label | Annotation Remarks 

Top level Subtype Subtype Convention** 
(level 1) (level 2) 

1 Noun N N 

1.1 Common NN NNN 

1.2 Proper NNP N NNP 

1.3 Nloc NST N NST 

2 Pronoun PR PR 

2.1 Personal PRP PR PRP 

2.2 Reflexive PRF PR PRF 

2.3 Relative PRL PR PRL 

2.4 Reciprocal PRC PR PRC 

2.5 Wh-word PRQ PR PRQ 

3 Demonstrative DM DM 

3.1 Deictic DMD DM DMD 

3.2 Relative DMR DM DMR 

3.3 Wh-word DMQ DM DMQ 

4 Verb V V 

4.1 Main VM V VM 

4.1.1 Finite VF V VM VF 

4.1.2 Non-finite VNF V VM VNF 

4.1.3 Infinitive VINF | V VM VINF 

4.1.4 Gerund VNG | VVM VNG 

4.2 Verbal Noun Verbal noun. | NNV N NNV Verbal Noun 

4.3 Auxiliary VAUX V VAUX 

4.1.2 Non-finite VNF | V VM VNF 

4.1.3 Infinite VINF | V VM VNF 

5 Adjective JJ 

6 Adverb RB Only manner adverbs 

7 Postposition PSP 

8 Conjunction CC CC 

8.1 Co-ordinator CCD CC CCD 

8.2 Subordinator CCS CC CCS 

8.2.1 Quotative UT CC CCS UT 

9 Particles RP RP 

9.1 Default RPD RP RPD 

9.2 Classifier CL RP CL 
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SI No. Category Label | Annotation Remarks 
Top level Subtype Subtype Convention** 
(level 1) (level 2) 

9.3 Interjection INJ RPINJ 

9.4 Intensifier INTF RP INTF 

9.5 Negation NEG RP NEG 

10 Quantifiers QT QT 

10.1 General QTF QT QTF 

10.2 Cardinals QTC QT QTC 

10.3 Ordinals QTO QT QTO 

11 Residuals RD RD 

11.1 Foreign word RDF RD RDF A word written in 
script other than the 
script of the original 

text 
11.2 Symbol SYM RD SYM For symbols such as 
$, & etc 

11.3 Punctuation PUNC RD PUNC Only for punctuations 

11.4 Unknown UNK RD UNK 

11.5 Echowords ECH RD ECH 
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